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2)  Configural  polysampling,  so  far  the  only  direct  approach  to  robustness  in  actual 
finite-sized  samples. 

3)  Compromise  MLEs  as  a  more  flexible,  often  closely  approximate,  approach  to 
robustness. 

4)  Calculations  for  many  explicit  randomizations  as  the  trustworthy  and  stringent 
analysis  for  randomized  experiment. 
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and  E.  L.  Reznikova,  under  the  direction  of  B.  F.  Pisarenko,  Mir 
Press,  Moskova,  USR,  693  pages. 
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Understanding  Robust  and  Exploratory  Data  Analysis,  John  Wiley  & 
Sons,  Inc.,  New  York. 
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Kafadar,  Karen  (1982b).  "Using  biweight  M-estimates  in  the  two-sample  problem 
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Also  Technical  Reports  Nos.  195,  252,  253,  254,  255. 

Pregibon,  Daryl  (1980a).  "Applications  of  resistant  fitting  to  a  class  of  nonlinear 
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42:  138-139. 
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5.  Broadbrush  account  of  work:  1974  —  1986 


The  work  of  this  project  has  been  mainly  focused  on  "procedures"  -  -  on  things  to  do 
with  data  -  -  on  data-analysis  techniques.  The  main  heads  under  which  we  shall  mention,  in 
broad  outline,  what  has  been  done,  are  five: 

a)  exploratory  and  graphical  procedures 

b)  robust/resistant  procedures 

c)  regression  procedures 

d)  analysis-of-variance  procedures,  including  f actorialization  and  multiple  comparisons 

e)  spectrum  analysis  procedures 

f)  randomization  in  experimentation. 

(These  heads  include  a  large  fraction  of  the  most  important  procedures  that  are  applied  to 
data.  In  the  cases  of  (a),  (b),  (d)  and  (e),  at  least,  where  we  stood  at  the  close  of  1973  had  been 
significantly  influenced  by  work  earlier  at  Princeton  involving  the  principal  investigator.)  The 
subsections  that  then  follow  mention  a  diversity  of  topics. 

*  general  techniques  and  philosophy  * 

With  a  broad  interest  as  have  just  been  sketched,  it  would  have  been  surprising  if  the 
work  of  the  project  had  not  included  some  general  accounts,  crossing  over  the  subdivisions  just 
noted.  We  sketch  them  here. 

In  1979-80,  C.  L.  Mallows  and  John  Tukey  prepared  a  rather  general  account  of  data 
analysis,  including  several  novel  ideas  and  concepts  (Tukey  (wTith  CLM)  1 982 j);  also  published 
as  Tukey  (with  CLM)  19821).  In  1980,  John  Tukey  presented  an  account  of  styles  of  data 
analysis  (Tukey  1982m).  In  1981,  he  discussed  the  future  as  the  keynote  speaker  at  the  14th 
meeting  of  the  "Interface"  (Tukey  1982b). 

In  1983,  he  discussed  the  relation  between  empirical  (i.e.  exploratory)  analysis  and 
narrowly  modelled  analysis  (Tukey  1983a). 


In  1985,  he  discussed  a  variety  of  general  issues  in  the  form  of  a  "Sunset  salvo*  (Tech 
Rep.  288,  Tukey  1986a) 

5a.  Graphical  and  exploratory  procedures 

The  connection  between  graphical  and  exploratory  procedures  is,  and  will  remain,  close. 
Besides  their  try-it-and-see  attitude  and  their  desire  to  see  what  might  be  so  —  rather  than 
being  limited  to  what  can  be  shown  beyond  a  more  or  less  reasonable  doubt  -  -  exploratory 
procedures  are  intended  to  help  us  see  things  that  we  did  not  expect  to  be  there.  In  this  task, 
graphical  presentation  has  been  our  main  stay.  Tables  of  preplanned  numbers,  or  of  formulas, 
are  not  good  tools  to  reveal  the  unanticipated.  Pictures  can  do  just  that. 

Equally,  graphic  displays  encourage  exploration. 

In  the  relatively  near  future,  we  can  expect  the  rapidly  falling  costs  of  computation  and 
the  equally  rapid  expansion  in  diversity  of  the  ways  in  which  they  can  look  at  data  to  make 
computer-scanning  an  important  input  alongside  human  scanning  of  pictures.  But  the 
computer’s  report  of  what  it  has  dredged  up  is  almost  certain  to  be  in  the  form  of  a  picture. 

*  structure  * 

We  shall  mention  work  in  this  area  under  six  heads: 

•  graphical  techniques  per  se 

•  the  pre-book  and  book  phases  of  EDA 

•  pushback 

•  other  exploratory  techniques 

•  cognostics. 

•  interactive  analysis 

*  graphical  techniques  per  se  * 

(This  work  appears  only  in  the  later  part  of  the  12  years  covered  by  this  report,  since 


earlier  work  at  Princeton  had  a  different  sponsor. 

In  1979-80,  P.  A.  Tukey  and  John  Tukey  prepared,  gave  and  published  a  3-lecture  series 
on  graphical  methods  for  interpreting  data  in  3  or  more  dimensions  (Tukey  (with  PAT) 
1981e).  These  chapters  incorporated  a  number  of  novel  approaches. 

In  1981-82,  they  prepared,  presented  and  published  an  account  focused  on  4-dimensional 
data  (Tukey  (with  PAT)  1982c). 

In  1983-85,  D.  C.  Hoaglin  and  John  Tukey  prepared  and  published  an  account  of 
graphical  techniques  for  comparing  observed  sequences  of  counts  with  the  standard  discrete 
distributions,  proposing  a  number  of  new  approaches  (Tukey  (with  DCH)  1983d). 

During  1984  and  1985  John  Tukey  and  P.  A.  Tukey  developed  three  sorts  of  "frames" 
(using  wooden  strips,  transparencies,  and  linking  devices)  to  allow  dynamic  views  to  be 
projected  with  an  ordinary  overhead  projector.  Both  layered  skewing  and  alternation  can  be 
shown,  each  by  an  appropriate  device  (unpublished,  but  publicly  demonstrated). 

In  1985  they  prepared,  presented  and  published  an  introduction  to  "Computer  Graphics 
and  Exploratory  Data  Analysis"  (Tukey  (with  PAT)  1985k). 

In  1985,  Eugene  Johnson  and  John  Tukey  prepared  an  account  of  an  exploratory, 
graphical  approach  to  factorial  (i.  e.  2  or  more  coordinates  in  all  combinations)  data  -  -  to  the 
sort  of  data  usually  treated  in  terms  of  the  classical  analysis  of  variance  (Tukey  (with  EJ) 
1986***). 

*  Exploratory  Data  Analysis  * 

During  1974-76,  much  effort  was  applied  to  new  techniques  of  exploratory  data  analysis. 
These  were  mainly  reported  in  the  First  Edition  of  John  Tukev’s  Exploratory  Data  Analysis. 
(Tukey  1977a). 

Applications  to  demography  were  reported  by  Mary  Breckenridge  (Tech.  Rep.  143)  and, 
later,  a  book,  (Breckenridge  1983),  with  an  appendix  by  John  Tukey,  (Tukey  1983a). 


Applications  to  semi-markov  process  data  were  studied  by  Henry  Braun,  leading  to 
regression-like  analysis  of  birth  intervals  (Braun  1980b). 

Applications  to  voting  behavior  were  studied  by  J.  L.  McCarthy  and  John  Tukey  (Tukey 
(with  JLM)  1978c). 

*  pushback  * 

The  use  of  "pushback"  in  which  individual  observed  values  are  modified  in  the  light  of 
the  presence  of  other  values  has  been  considered,  at  intervals  throughout  the  twelve  years 
1974-86. 

The  results  of  early  work  by  John  Tukey  on  the  question  of  "deblurring"  an  observed 
distribution  -  -  doing  one’s  best  to  eliminate  the  effects  of,  say,  measurement  error  were 
published  in  1974  (Tukey  1 974g). 

Work,  in  1974-75,  by  L.  F.  \anni  and  John  Tukey  focussed  on  two  uses  -  -  as  a 
contribution  to  robust  centering  and  as  a  route  to  a  plot  that  would  do  better  what  a 
conventional  probability  plot  would  do  (unpublished). 

In  1980-81,  Katherine  (Bell)  Krvstinik  carried  through  a  study  of  the  first  use,  finding 
that  simple  summaries  of  pushed-back  values  showed  high  performance,  and  preparing  a 
Ph.D.  Thesis. 

In  1986,  John  Tukey  returned  to  this  topic,  and  suggested  a  variety  of  promising 
modifications  (unpublished). 

*  other  exploratory  techniques  * 

Over  a  period  of  time,  D.  C.  Hoaglin,  B.  Iglewicz  and  John  Tukey  have  studied  to 
quantitative  null  behavior,  for  both  Gaussian  and  long-tail  parents,  of  the  "fence-and-outside" 
labelling  procedure  proposed  in  Exploratory  Data  Analysis.  (Tukey  (with  DCH,  BI)  1981a 
and  1986**). 

In  1985-6,  Katherine  Hansen  and  John  Tukey  studied  more  sophisticated  approaches  to 


clustering  -  -  to  dividing  a  given  set  of  points  into  "clusters"  on  the  basis  of  their  mutual 
distances.  By  avoiding  barriers  of  required  simplicity,  it  has  been  possible  to  greatly  improve 
the  sensitivity  of  such  a  procedure.  Present  techniques  separate  moderately  overlapping 
Gaussian  distributions  almost  as  well  as  the  linear  discriminants  would  that  could  be 
calculated  only  if  the  true  distributions  were  known.  Work  continues  (unpublished,  some 
aspects  reported  in  a  talk  to  the  \orth  American  Classification  Society). 

*  cognostics  * 

Computer  inter  preted  diagnostics  seem  almost  certain  to  be  the  handmaiden  and 
supplement  to  graphical  display. 

Attention  to  cognostics  was  first  given  in  1981  (Tukev  1981U_).  A  set  of  plausible 
suggestions  were  made  in  1985  (Tukey  (with  PAT)  1985k). 

*  interactive  analysis  * 

Motivated  by  the  need  to  display  effectively  what  has  already  been  tried  on  a  data  set, 
John  Tukey  has  prepared  several  drafts  of  an  account  of  what  operations  might  be  provided 
and  how  the  history  of  their  use  might  be  displayed  (work  in  progress). 

5b.  Robust/resistant  procedure 

At  the  close  of  1973,  the  Princeton  Robustness  Study  (*Tukey,  (with  DFA,  PJB,  FRH, 
PJH,  WHR)  1972c)  had  circulated  for  over  a  year.  Very  considerable  progress  had  been  made 
on  the  simplest  problem:  estimating  a  "center"  from  a  single  batch  of  numbers  in  a  robust  -  - 
and  consequently,  would  seem,  in  a  resistant  way.  Most  of  the  improvements  found  in  the 
study  can  be  traced  to  the  results  of  empirical  trials,  either  directly  or  through  careful 
thought  applied  to  understanding  these  results.  The  twelve  years  that  have  followed  at 
Princeton  havt  seen  (i)  an  extension  of  the  procedures  to  a  variety  of  problems,  (ii)  the 
development  of  configural  polvsampling  which  enables  us  to  bound  the  possible  (in  finite-sized 
samples)  and  (iii)  recent  asymptotic  developments,  which  promise  a  new  era  of  broadening  of 


applications. 


At  the  close  of  1973,  the  early  beginnings  of  robust  (non-linear)  smoothing  were  in 
place.  (The  start  may  have  been  in  the  Limited  Preliminary  Edition  of  Exploratory  Data 
Analysis  (*Tukey  1971a).) 

We  shall  review  project  activity  here  under  8  heads: 

•  robust  (non-linear)  smoothing  ^1974-1986) 

•  extended  problems  (1975-80) 

•  empirically-based  improvement  techniques  (1977-76) 

•  uses  of  order  statistics  (1979-83) 

•  robust  shape  comparisons  (1980-83) 

•  configural  polysampling  (1980-86) 

•  newr  problem  types  (1982-86) 

•  asymptotic  improvement  techniques  (1983-86) 

In  addition  to  what  is  reported  here  in  5b,  large  parts  of  what  is  reported  below  under 
5c  (regression),  5d  (analysis  of  variance),  and  5e  (spectrum  analysis)  involve  the  use  of 
robust-resistant  concepts  so  deeply  is  to  fit  in  here  in  5b  had  we  wished.  And  the  importance 
of  the  procedures  in  5f  stems  from  the  possibility  of  using  robust/resistant  summaries.  We 
have  chosen  the  actual  structure,  however,  in  order  to  keep  the  story  as  simple  as  we  know 
how  to  do. 

The  relation  to  5a  is  also  closer  than  one  might  think.  While  the  emphases  are  quite 
different  -  -  "finding  appearances"  in  5a,  "being  stringent  in  diverse  circumstances"  here  -  - 
good  examples  of  exploratory  procedure  have  to  have  a  good  dose  of  robustness,  even  though 
extreme  high  stringency  usually  need  not  be  sought.  (As  we  move  to  deeper  involvement 
w'ith  modern  computers,  greater  stringency  that  involves  no  other  cost  than  computation  is 
more  and  more  likely  to  be  seized  upon.) 
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This  latter  convergence  is  to  be  seen  rather  clearly  in  the  combined,  and  on  occasion  even 
interlaced,  treatment  of  exploratory  and  robust  techniques  in  the  two  recent  books  -  - 
Understanding  Robust  and  Exploratory  Data  Analysis  and  Examining  Data  Tables,  Trends, 
and  Shapes  -  -  edited  by  D.  C.  Hoaglin,  F.  Mosteller,  and  John  Tukey  (Tukev  twith  DCH  and 
FM)  1983b,  1985b). 


*  robust  (non-linear)  smoothing  * 

Smoothing  by  moving  linear  combinations  was  the  classical  form  of  smoothing,  suffering 
from  (i)  too  much  attention  to  outliers  and  (ii)  filling  in  valleys  and  cutting  down  hills.  Use 
of  non-linear  moving  combinations  can  ameliorate  both  these  difficulties.  It  still  does 
"smoothing  by  value-change,"  giving  us  a  smoothed  value  in  place  of  each  initial  value.  In 
certain  applications,  there  is  use  for  a  different  process  -  -  "smoothing  by  excision"  -  -  in  which 
we  set  aside  some  of  the  initially  given  values. 

The  use  of  robust  smoothing  as  a  way  to  robustify  the  fitting  of  polynomials  was 
developed  and  illustrated  by  A.  E  Beaton  and  John  Tukey  (Tukey  with  AEB)  1974c). 
Improved  techniques  of  robust  smoothing  were  developed,  and  reported  in  the  first  edition  of 
Exploratory  Data  Analysis  (Tukey  1977a). 

The  role  of  "head  banging"  as  a  fundamental  concept  in  the  construction  of 
robust/resistant  smoothers  was  recognized  in  1978  and  extended  to  smoothing  in  the  plane  (cp. 
Tukey  (with  PAT)  1981e). 

Re-expression  of  the  numerical  responses  (or  numerical  circumstances)  we  are  analyzing 
-  -  as,  in  the  simplest  case,  by  taking  logarithms  -  -  is  often  important.  We  would  usually  like 
to  have  the  choice  made  robustly.  In  many  situations  it  is  natural  to  "guide"  the  re-expression 
by  a  rank  related  version  of  the  numbers  at  hand.  Typically  we  then  want  to  be  guided  in 
the  large-scale  behavior  of  the  result,  but  to  reflect  the  small-scale  behavior  of  the  original 
numbers.  A  procedure  for  doing  this,  emphasizing  smoothing  by  excision  and  based  on  starting 
to  smooth  divided  differences  (that  connect  initial  values  to  guiding  values),  is  called  smelting 


and  leads  to  pictures  which  can  guide  the  choice  of  simple-formula  re-expressions.  (Tukey 
1982n). 

Recently,  John  Tukey  has  been  reviewing  and  extending  the  available  robust  smoothing 
techniques  (Technical  Report  in  preparation). 

*  extended  problems  * 

In  the  latter  half  of  the  70’s,  the  procedures,  and  the  background  leading  to  their  choice, 
for  the  one-sample  problem  of  center  finding  were  extended  to  other  related  problems. 

The  problem  of  robust/resistant  assessment  of  width  of  distribution  was  addressed  in 
1975-77  (Tech.  Rep.  129  (Tukey  (with  FM)  1977b).  Further  small  improvements  were  made 
later  (un  reported). 

Work  directed  toward  improved  empirical  assessment  of  stability  for  widthers  was 
begun  in  1977  and  then  interrupted. 

The  extensions  of  one-sample  robust-resistant  estimates  to  two-sample  comparisons 
(differences)  of  centers  was  successfully  undertaken  in  1978-79  by  Karen  Kafadar  (Ph.D. 
Thesis  1979,  Tech.  Reps.  152,  154  Kafadar  1982a,  1982b,  1985).  She  also  showed  that  both  the 
interval  estimates  proposed  by  Gross  in  the  one-sample  case,  and  her  intervals  for  differences 
still  behaved  very  well  for  confidence  coefficients  very  close  to  unity  ( tail  areas  like  0.01%). 

An  approach  to  robust  correlation  based  on  robust  widthing  was  suggested  in  the  book 
by  F.  Mosteller  and  John  Tukey  (Tukey  (with  FM)  1977b). 

The  bias  of  repeated-median  correlation  estimation  was  assessed  (by  simulation)  by 
Andrew  Siegel  in  1980  (unpublished). 

For  extensions  to  regression  see  5c  and  for  extensions  to  analysis  of  variance  see  5cL 

*  empirically-based  improvement  techniques 

Methods  of  improving  the  performance  of  robust  estimators  by  combining  two  or  more, 
using  each  in  its  prescribed  part  of  the  configuration  space,  were  developed  and  reported 
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(Tukey  1979c). 


*  use  of  order  statistics  * 

The  rdnther,  technique  for  rapid  center  estimation  "on  the  fly"  when  dealing  with  very 
large  samples  was  developed  and  reported  (Tukey  1978e). 

The  behavior  of  order  statistics  from  the  3  original  corner  distributions  was  studied  by 
Andrew  Bruce  (senior  thesis,  supervised  by  Daryl  Pregibon)  who  later  extended  his  results  to 
2  additional  stretched-tail  distributions.  Among  linear  combinations  of  order-statistics,  his 
results  favor  the  use  of  the  so-called  ab-mean.  (unpublished).  Related  material,  involving  the 
"2nd  representing  function",  was  reported  in  1981  by  Andrew  Bruce,  Daryl  Pregibon  and 
John  Tukey  (Tech.  Rep.  186). 

The  possibilities  of  "easy-t"  confidence  intervals  based  upon  only  2  order  statistics,  actual 
or  mid-interpolation,  for  sample  sizes  of  5  to  20  were  studied  and  reported  by  Paul  Horn 
(Ph.D.  thesis  1981;  Horn  1983,  Horn  1985;  Tech.  Rep.  229). 

*  robust  shape  comparison  * 

The  comparison  of  shapes  of  geometric  figures  (including  projections  of  shapes  of  animals) 
is  often  better  conducted  when  the  rescalings  and  rotations  implied  by  "shapes"  are  done 
robustly.  Work  on  such  methods,  using  repeated  medians,  had  been  initiated  by  Andrew 
Siegel  before  coming  to  Princeton.  Extensions  to  the  three-dimensional  case  were  studied  by 
John  Pinkerton  (Junior  Paper  under  Siegel’s  direction.  Tech.  Rep.  217,  Siegel  and  Pinkerton 
1986). 

Both  repeated-median  and  least  square  methods  were  computerized  by  Siegel  (Tech.  Rep. 
193,  Siegel  1982e).  The  usefulness  of  such  techniques  was  demonstrated  on  an  anthropological 
example  bv  A.  F.  Olshan,  Andrew  Siegel,  and  D.  R.  Swindler  (Tech.  Rep.  222,  Siegel  (with 
AFO,  1982a). 

Shape  and  pattern  matching  were  reviewed  by  R.  H.  Benson,  R.  E.  Chapman  and 
Andrew  F.  Siegel  (Tech  Rep.  224,  Siegel  (with  RHB  and  REC)  1982b). 


The  unique  decomposition  of  a  multivariate  log-normal  population  into  statistically 
independent  shape  and  size  variates  was  found  by  P.  D.  Sampson  and  Andrew  Siegel  (Siegel 
(with  PDS)  1984b,  1985d). 

*  growth  modelling  * 

Colin  Goodall  has  developed  (i)  a  growth  model  comprising  a  continuously-varying 
deformation  tensor  field  with  explicit  components  for  measurement  error,  shape  change,  and 
inter-individual  variation  and  (ii)  a  finite-element  model  involving  cell-reinforcement 
polarity,  cell-division  direction,  and  cell  deformation.  (Work  in  progress). 

*  configural  polysampling  * 

The  most  clearly  defined  estimation  problems  are  those  in  which  an  invariance 
requirement  ensures  that  no  personal  prejudice  or  external  information  about  which  estimate 
values  are  likely  is  involved  in  the  results.  (External  information  about  other  matters,  such 
as  shapes  of  distribution  likely  or  possible  can,  and  often  should  be  included.)  The 
developments  of  the  Princeton  Robustness  Study  (*Tukey  (with  DFA,  PJB,  FRH,  PJH,  WHR) 
1972e)  left  the  finite-sample  -  -  realistic  -  -  study  of  robust  estimation  as  something  focused  on 
selected  examples  of  distribution  shape,  often  as  few  as  two. 

In  the  centering  (location)  problem,  a  configuration  consists  of  all  sets  of  values  { y, } 
which  differ  from  one  another  only  by  location  and  scale  -  -  that  is  all  of  the  form  { A  +  by, } 
for  fixed  { >>, )  and  any  a  and  any  b  tO  (or,  sometimes,  any  b  >  0).  Here  the  natural 
ir. v  ariance  requirement  is  an  equivariance  one,  namely 

estimate  from  I  a  +  by,  1  =  a  +  6 -estimate  from  (y, } 

Prior  to  the  introduction  of  configural  polysampling,  which  began  in  1980,  attempts  to 
optimize  centering  performance  for  a  particular  sample  size  and,  say,  two  particular 
distribution  shapes,  involved 


-  t, 

•  many  repetitions  of  inventing  plausible  estimates 

•  drawing  separate  sets  of  samples  f rom  each  distribution  shape, 

•  evaluating  the  performance  of  each  estimate  at  each  sample  (usually  for  a  family  of 
related  samples,  but  usually  not  for  a  whole  configuration),  still  for  each  situation 
separately, 

•  plotting  the  results,  with  performance  for  each  shape  labelling  the  corresponding  axis, 

•  looking  at  the  plot,  to  guess  both  where  the  boundary  between  the  possible  and  the 
impossible  lay  and  what  kind  of  a  further  modified  estimate  would  bring  us  closer  to  the 
boundary. 

The  Princeton  Robustness  Study  and  its  follow  on  involved  this  sort  of  trial  for  700-odd 
estimates  (and  several  thousand  50-50  mixtures  of  estimate  pairs).  We  still  had  no  clear  idea 
where  the  boundary  fell. 

The  introduction  of  configural  polysampling,  which  began  in  1980,  made  it  possible  to 
work  with  a  single  set  of  configurations,  applicable  when  appropriately  weighted  to  each  of 
the  shapes  concerned.  (Polysampling  refers  to  a  single  set  of  samples  which,  when  used  with 
different  weights,  provide  weighted  random  samples  from  two  or  more  populations.)  Here, 
working  with  a  configuration  for  a  shape  means  evaluating,  by  numerical  integration,  a  small 
number  of  two-dimensional  integrals.  The  combination  of  (i)  these  numerical  integration 
results  and  (ii)  "shadow  prices"  for  the  shapes,  allows  us  to  determine  the  values  of  optimized 
estimates  at  each  configuration  used,  and  the  corresponding  conditional  variances,  one  per 
shape.  The  weighted  sampling  of  configurations  extends  this  result  -  -  subject  to  sampling 
error,  as  in  all  forms  of  simulation  -  -  to  the  unconditional  variances,  still  one  per  shape.  Each 
set  of  shadow  prices  lets  us  estimate  a  point  on  the  boundary  between  the  possible  and  the 
impossible.  Taking  the  shadow  prices  as  parameters,  we  can  estimate  the  whole  run  of  the 
boundary. 

Work  on  the  fundamental  ideas  and  formulas  began  in  1980  and  was  reported  by  Daryl 
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Pregibon  and  John  Tukey  in  1981  (Tech.  Rep.  185).  Extensions  and  related  matters  were 
reported  by  John  Tukey  (Tech.  Rep.  189)  and  by  Katherine  Bell  (later  Krystinik)  and  Daryl 
Pregibon  (Tech.  Rep.  1 91 ).  An  illuminating  comparison  of  two  standard  M-estimate  (each 
both  fully  iterated  and  one-step)  was  made  by  Katherine  Bell  (later  Krystinik)  and  Stephan 
Morgenthaler  (Tech.  Rep.  195). 

Michael  Cohen  worked  on  improvement  of  simple  estimates  by  regression  adjustments 
guided  by  optimum  values  obtained  by  configural  polysampling  in  1981-82  (unreported). 

John  Tukey  worked  on  other  improvement  schemes  in  1982  (unreported). 

Stephan  Morgenthaler  initiated  work  on  optimization  for  two  end  shapes  and  an 
intermediate  shape  in  1981.  This  work  was  continued  by  Michael  Cohen,  George  Grover, 
Stephan  Morgenthaler  and  John  Tukey  (not  reported  in  detail,  but  see  Tukey  1986*). 

Applications  of  double  sampling  to  improve  computational  efficiency  were  studied  by 
Stephan  Morgenthaler  and  John  Tukey  (Tech.  Rep.  252). 

For  applications  of  configural  polysampling  to  regression  see  section  5C  below. 

A  review  paper  on  configural  polysampling  was  presented  to  the  Society  for  Industrial 
and  Applied  Mathematics  as  the  25th  von  \euuman  lecture,  and  will  appear  in  SIAM 
Review  (Tukey  1986*). 

In  order  to  make  available  an  adequate  account  of  configural  polysampling,  Stephan 
Morgenthaler  and  John  Tukey  have  undertaken  the  editing  of  a  volume  on  this  subject,  with 
chapters  contributed  by  all  of  those  named  above. 

*  new  problem  types  * 

J.  Pederson,  with  Daryl  Pregibon ’s  guidance,  wrote  a  junior  paper  (1980)  on  small-sample 
properties  of  resistant  estimates  in  non-location  situations. 

Possibilities  for  alternative  definitions  of  confidence  limits  in  a  configuration-oriented 


robust  world  were  suggested  by  John  Tukey  (Tech.  Rep.  190). 
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The  actual  situation  for  confidence  intervals  was  investigated  in  some  detail  by  Stephan 
Morgenthaler  (Ph.D.  Thesis  1983,  Tech.  Reps.  253,  254,  255,  Morgenthaler  1986b)  who 
examined  confidence  intervals  for  width  (for  scale)  as  well  as  for  center  (for  location). 

Robust  estimation  in  non-linear  models  is  being  studied  by  Elvezio  Ronchetti  and  S. 
Morgenthaler. 

Robust  signal  detection  has  been  studied  by  Elvezio  Ronchetti  and  M.  Weiss  (1983-4, 
unreported). 

Robust  model  selection  is  discussed  under  5c,  regression  procedures. 

Asymptotics  for  configural  location  estimates  have  been  discussed  by  Stephan 
Morgenthaler  (Morgenthaler  1985a)  and  George  Easton  (Ph.D.  Thesis  1985). 

*  advanced  improvement  techniques  * 

Optimizing  asymptotic  variance  of  an  R-estimator  while  bounding  its  sensitivities  to  gross 
errors  and  changes  of  variance  was  studied  by  Elvezio  Ronchetti  and  J.  Yen  (Tech.  Rep.  266, 
Ronchetti  and  Yen  1 986a). 

The  relation  of  small-sample  asymptotics  to  bootstrapping  has  been  studied  by  Elvezio 
Ronchetti  (1983-85,  work  in  progress). 

The  use  of  small-sample  asymptotics  to  find  approximations  to  the  density  of  trimmed 
means  has  been  explored  by  Elvezio  Ronchetti  and  George  Easton  who  have  extended  the 
applicability  of  saddle-point  approximations  to  general  statistics,  including  linear  combinations 
of  order  statistics  (Tech.  Rep.  274,  Easton  and  Ronchetti  1986). 

George  Easton  has  studied  the  asymptotic  behavior  of  the  optimal  compromise  estimates 
for  location.  The  approximations  required  led  to  a  new  class  of  compromise  estimators 
(CMLE’s)  that  are  based  on  the  likelihood  functions  of  the  chosen  situations.  Finding  these 
(closely  approximately  optimum)  estimates  is  much  simpler  -  -  a  biparameter  optimization 
instead  of  a  two-dimensional  numerical  integration  -  -  than  finding  the  exactly  optimum  ones. 
Their  performance  is  very  promising;  they  should  be  extendible  to  problems  we  do  now  know 


V 


-30- 


otherwise  how  to  attack.  (Ph.D.  Thesis,  1985). 

George  Easton  has  also  found  that  a  simple  modification  to  the  likelihood  function  of  the 
slash  distribution  substantially  improves  the  performance  of  both  its  (pseudo)  MLE  and 
(pseduo)  CMLE’s  (Ph.D.  Thesis,  1985). 


*  expositions  and  weeks  * 

Robust  methods  were  expounded  for  the  user  John  Tukey  (Tukey  1979). 

The  teaching  of  robust  methods  was  expounded  by  Andrew  Siegel  (Tech.  Rep.  173,  Siegel 
(with  (RG,  JK,  and  PAT)  1983c). 

Robustness  weeks,  where  selected  active  workers  in  the  field  could  interact  effectively, 
were  held  in  Princeton  15-19  March  1980  and  4-8  May  1981. 


5c.  Regression  procedures 


Work  here  is  moderately  diverse,  and  is  discussed  under  these  heads: 


•  functional  issues 


•  repeated  medians 

•  regression  in  general  classes 

•  fitting  straight  lines 

•  strengthening  theoretical  understanding. 

A  general  account  of  some  elementary,  but  vital,  issues  in  regression  are  given  in  Data 
Analysis  and  Regression  (Tukey  (with  FM)  1977b).  A  variety  of  topics  are  discussed  in 
Expiating  Data  Tables,  Trends  and  Shapes  (Tukey  (with  DCH,  FM)  1985b). 


*  functional  issues  * 


A  formalized  fitting  (regression)  problem  has  two  parts: 
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•  the  functional  model,  which  describes  the  alternatives  from  which  the  fit  is  to  be 
chosen,  and 

•  the  stochastic  model,  which  describes  how  the  error  might  be  distributed  -  -  best  in 
terms  of  several  alternatives,  each  ordinarily  involving  at  least  on  variability  parameter. 

It  will  clearly  be  increasingly  important  to  bring  into  the  functional  model  flexibility 
analogous  to  what  robust/resistant  techniques  have  put  into  the  stochastic  model. 

One  way  to  do  this  is  to  consider  what  are  the  natural  generalizations  of,  say,  a  +  bx . 
The  answer,  as  John  Tukey  had  recognized  by  1979,  was  usually  not  to  go  to  polynomials  of 
degree  2  or  more.  Rather,  in  most  situations,  other  simple  functions  preserving  monotonicity 
rather  than  preserving  additive  appearance  of  the  constants,  are  a  natural  choice. 

If  we  consider,  for  flexibility,  two  (or  more)  alternative  families  of  possible  fits,  neither 
including  the  other,  it  is  not  usually  sensible  to  estimate  a  single  parameter  (or  a  single  set  of 
parameters)  for  both.  But  it  may  be  not  only  possible  but  desirable  to  ask  if  fitting  either 
(possibly  bothl  shows  a  significant  improvement  in  the  remaining  residuals.  If  the  families  are 
rather  similar,  the  usual  (or  unusual)  tests  for  fitting  either  separately  will  be  highly- 
correlated.  Thus  a  Bonferroni  calculation  will  be  grossly  overconservative.  If  we  are  dealing 
with  an  experimental  situation  where  empirical  randomization  (perhaps  within  a  limited 
randomization)  can  be  used,  we  can  easily  assess  the  significance  of  any  convenient 
combination  of  the  two  or  more  tests  (Tukey  19851). 

More  attention  to  such  functional  problems  is  needed. 

*  repeated  median  * 

Beginning  in  1980,  Andrew  Siegel  studied  and  reported  on  a  technique  of  repeated 
median  fitting  (Tech.  Rep.  172,  Siegel  1982c).  This  technique  is  quite  useful  for  2-parameter 
problems,  like  fitting  straight  lines,  but  rapidly  demands  heavy  computations  as  the  number 
of  parameters  increases.  The  relationship  of  repeated  median  fitting  to  other  types  of  fits  has 
been  discussed  by  I.  M.  Johnstone  and  Paul  Velleman  (Velleman  (with  IMJ)  1985). 


*  regression  in  general  classes  * 

In  1980,  Daryl  Pregibon  prepared  and  gave,  at  Perugia,  a  five- lecture  series  on  data- 
analytic  methods  for  fitting  generalized  linear  models  (unpublished). 

In  1980,  Daryl  Pregibon  reported  on  techniques  for  dealing  with  logistic  regression 
problems  (Pregibon  1980a,  1980b)  and  discussed  McCullagh’s  work  on  related  topics  (Pregibon 
1980c). 

In  1983,  John  Tukey  and  Paul  Velleman  prepared  a  draft  account  of  what  a  reasonable 
computer-supported  regression  procedure  might  include  (unpublished). 

Robust  model  selection  in  regression  has  been  studied  by  Elvezio  Ronchetti  (Tech.  Rep. 
259,  Ronchetti  1985). 

A  plausible,  highly  heuristic  proposal  on  bow  to  start  a  robust-regression  was  made  by 
John  Tukey  in  1984  (unpublished). 

*  fitting  straight  lines  * 

The  eye-fitting  of  straight  lines  was  studied  b}’  F.  Mosteller,  Andrew  Siegel,  E  Trapido, 
and  C.  Youtz  (Tech.  Rep.  183,  Siegel  (with  FM,  ET,  CY)  1981). 

The  fitting  of  robust  straight-lines  by  a  variety  of  simple  subgroup-oriented  procedures 
has  been  compared  by  1.  M.  Johnstone  and  Paul  Velleman  (Velleman,  with  IMJ)  1985).  A 
supplement  analyzing  their  numerical  results  more  thoroughly  was  prepared  by  John  Tukey 
1 985o). 

A  practical  approach  to  configural  polysampling  for  straight-line  regression  has  been 
pioneered  by  Fanny  (Zambuto)  O’Brien  (Ph.D.  Thesis  1984,  Tech.  Rep.  277,  278,  279,  282). 

This  approach  replaces  2-dimensional  numerical  quadrature  by  finite  summation  of  algebraic 
expressions  for  a  somewhat  restricted  family  of  joint-distribution  shapes. 

An  approach  by  combined  approximation  and  sampling  methods  to  short-cutting  the 
computational  difficulties  with  Fanny  O’Brien’s  approach  is  being  investigated  by  Ha  Nguyen 
(w'ork  in  progress). 


A  comment  on  Huber’s  review  of  projection  pursuit,  including  projection  pursuit 
regression,  was  published  by  John  Tukey  (1985m). 

*  strengthening  theoretical  understanding  * 

Change-of-variance  sensitivities  in  regression  analysis  have  been  studied  by  P.  Rousseuw 
and  Elvezio  Ronchetti  (Ronchetti  (with  PR)  1985). 

A  long-term  program  to  attack  the  difficult  questions  of  optimal  fitting  with  3  or  more 
parameters,  where  other  methods  do  not  seem  practical,  was  undertaken  by  Elvezio  Ronchetti 
in  1983,  using  a  small-sample-asymptotics  approach.  Early  steps  include  a  variational  equation 
for  bioptimal  estimates  of  a  single  parameter  (in  progress). 

Bounded-influence  inference  in  regression  has  been  studied  by  Elvezio  Ronchetti  (Tech. 
Rep.  257). 

Robust  C(  a  )-type  tests  for  linear  models  have  been  studied  by  Elvezio  Ronchetti  (Tech. 
Rep.  258). 

*  week  * 

A  regression  week,  where  selected  active  workers  in  the  field  could  interact  effectively 
was  held  in  Princeton  16-19  .November  1983. 

5d.  Analysis-of-variance  procedures 
(including  factorialization  and  multiple  comparisons) 

The  "analysis  of  variance",  as  applied  to  numbers  that  fall  naturally  into  a  factorial 
pattern,  sometimes  refers  to  the  general  idea  of  splitting  up  observed  responses  into  (a  common 
term,)  main  effects,  and  interactions  of  various  orders,  and  sometimes  to  a  specific  way  of 
reporting  summary  information  about  such  splittings  due  to  R.  A.  Fisher  (more  than  60  years 
ago).  Alongside  of  regression  procedures,  analysis-of-variance  procedures  are  certainly  among 
those  most  frequently  and  most  usefully  applied  to  data.  This  section  takes  the  more  general 
interpretation  and  is  divided  into: 


•  topics  within  classical  anova 


•  new  value-splittings 

•  anova  as  it  should  be 

The  first  of  these  comes  closest  to  the  narrower  interpretation. 

*  topics  within  classical  anova  * 

Higher-rank  fits,  going  beyond  the  usual  fitting  of  additive  main  effects,  had  been 
discussed  by  many  authors,  including  D.  R.  McNeil  and  John  Tukey  (1975a).  Applications  of 
these  fits  to  demography  were  studied  by  Mary  Breckenridge  (Tech.  Rep.  143,  Breckenridge 
19S3). 

Robust  analogs  of  the  quantities  arising  in  the  classical  analysis  of  variance  were  studied 
by  Henry  Braun  and  D.  R.  McNeil  (1981). 

Multiple  comparison  procedures  seem  to  be  inevitably  needed  as  soon  as  main  effects 
involving  more  than  one  degree  of  freedom,  and  not  naturally  divided  into  single  degrees  of 
freedom,  are  to  be  dealt  with  (e.g.  Tukey  1949f).  As  the  thirty-odd  years  have  passed  since 
such  procedures  first  became  prominent,  a  variety  of  procedures  have  been  developed  bv 
several  authors.  In  1983,  H.  Braun  and  John  Tukey  (Tukey,  (with  HB)  1983c)  reported  on  a 
new  and  promising  procedure,  specially  adapted  to  the  third  (and  perhaps  the  second)  slice  of 
a  large  family  of  such  problems,  where  the  slices  can  be  roughly  characterized  by  these  verbal 
desires: 

1st  slice:  I  want  to  find  at  least  one  simultaneously  significant  comparison;  I’ll  be  glad  to 
have  more. 

2nd  slice:  Surely  this  experiment  measures  well  enough  for  some  comparisons  to  be 
significant;  I  want  to  find  as  many  significant  as  1  can. 

3rd  slice:  Surely  this  experiment  measures  well  enough  to  find  many  significant 
comparisons;  I  want  to  find  all  that  are  really  different. 


In  addition  to  these  alternative  desires  for  significance,  there  is  a  parallel  desire  for  confidence, 
which  should  be  answered  quite  differently. 

*  new  value-splittings  * 

One  of  the  innovations  in  Exploratory  Data  Analysis  (Tukey  1977a)  and  its  preliminary 
editions  was  the  emphasis  on  median  polish  -  -  the  taking  out  iteratively,  in  various  directions, 
of  medians  -  -  as  a  way  of  splitting  up  the  given  values  that  appeared  in  a  factorially- 
patterned  table,  whether  row-by-column  or  more  complicated.  If  all  the  fibers  -  -  all  the 
one-dimensional  subarrays  -  -  in  such  a  table  have  zero  medians,  the  table  is  said  to  have 
median  balance.  (A  program  for  median  polishing  up  to  7-way  tables  was  prepared  in  1980.) 

By  1980,  the  advantages  of  modified  medians  -  -  when  we  have  an  even  number  of 
values  to  median,  we  have  some  choice  along  the  interval  joining  the  two  central  values  -  - 
began  to  be  clear.  (The  lorrutdian  is,  in  such  a  case  the  lower  of  the  two  central  values.) 

During  1980-83,  Andrew  Siegel  studied  the  possibility  of  finding  value-splitting  with  " 
many  (exact)  zeroes"  -  -  splittings  such  that  each  subtable  has  no  more  non-zero  entries  than  it 
has  degrees  of  freedom.  For  the  two-way  table  he  found  that  many  zeroes  can  always  be  had 
in  combination  with  minimum  L  rnorm  and  lomedian  balance  (Siegel  1983c).  In  1985 
Fugene  Johnson  (Ph.D.  Thesis,  forthcoming)  showed  that  this  is  true  for  k-way  tables  with 
any  k. 

During  1982-83.  Andrew  Siegel  studied  the  use  of  the  scaled  non-central  chi-square 
distribution  with  zero  degrees  of  freedom  in  modelling  data  sets  containing  exact  zeroes  (Tech. 
Rep.  239,  Siegel  1985a). 

*  anova  as  it  ought  to  be  * 

The  original  Fisher  concept  -  -  make  the  analysis-of-variance  the  experimental  design 
called  for,  and  stare  at  the  table  of  mean  squares  -  -  took  us  a  long,  long  way.  In  1960,  B.  F. 
Green,  Jr.  and  John  Tukey  reported  on  various  extensions  of  this  approach,  as  exemplified  on  a 
psychometric  example  of  P.  Johnson  and  F.  Tsao.  The  implications  of  some  of  these  extension 
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did  not  all  become  clear  for  two  decades.  In  particular,  for  a  long  while  the  introduction  of 
an  incomplete,  well-tailored  down-sweeping  step  to  follow  the  upsweeping  corresponding  to 
the  standard  fitting  was  not  recognized  as  a  general  characteristic  to  be  sought  in  both 
robust/resistant  and  more  conventional  analysis  of  variance. 

In  1980-82,  Allan  Seheult  and  John  Tukey  (Tukey  (with  AS)  I982i)  updated  and 
expounded  a  robust/resistant  analysis  for  2n  patterns  of  data  (n  factors  each  in  2  version  or  at 
2  level). 

At  about  the  same  time,  Allan  Seheult  and  John  Tukey  recognized  that  a  robust  version 
of  the  upsweeping  process  could,  and  plausibly  should,  involve  (i)  a  many-zero  fitting,  (ii) 
identification  and  setting  aside  or  modification  of  idiosyncratic  entries  in  each  subtable 
(sometimes  called  "assassination")  and  (iii)  least-square  fitting  to  the  resulting  tables. 
(Unpublished  work  at  Bell  Laboratories  by  E  Fowles  and  J.  McRae  made  useful  contributions 
here.)  (Though  still  unpublished,  this  work  had  important  influences  on  the  three  items  to  be 
next  reported.) 

In  1984-86,  Eugene  Johnson  studied  carefully  the  behavior  of  such  a  three-stage 
procedure.  He  found  a  Gaussian-slash  bi-efficiency  of  95%  achievable  by  such  a  procedure 
(forthcoming  Ph.D.  Thesis).  By  comparison  with  bi-efficiencies  of  centering  single  samples,  this 
seems  to  be  a  very  high  bi-efficiency  indeed. 

In  1985,  Eugene  Johnson  and  John  Tukey  (1986***)  prepared  an  account,  based  in  part 
on  Johnson’s  tools  developed  in  his  thesis,  of  a  graphical,  dissect-everything-into-single- 
degrees-of -freedom  approach  to  a  classical  analysis  of  variance  that  seemed  to  deserve  the  title 
"Graphical  Exploratory  Analysis  of  Variance".  (The  robust  analog  will  be  presented  in 
Eugene  Johnson’s  forthcoming  Ph.D.)  Thesis. 

Starting  in  1985,  John  Tukey  is  developing  successive  drafts  of  a  long  paper  on  what  a 
good  computer  data-analysis  system  ought  to  do.  One  long  section  deals  with  approaches  to 
factorialization  (work  in  progress). 


5e.  Spectrum  analysis  procedures 

Modern  spectrum  analysis  is  thirty-odd  years  old  (cp  Tukey  1984b),  and  has  seen  many 
steps  forward. 

The  problem  of  assessing  the  underlying  power  spectrum  of  a  time  series  when  our 
observations  are  contaminated  by  "spiky  noise"  was  studied  by  Michael  Schwarzchild  (Ph.D. 
Thesis  1975)  with  encouraging  results. 

A  discussion  of  the  "styles"  of  spectrum  analysis  was  prepared  and  published  by  John 
Tukey  1984a). 

The  relationship  between  the  two  most  important  of  these  styles:  classical  spectrum 
analysis  -  -  the  method  of  choice  for  Gaussianly  distributed  time  series  -  -  and  maximum 
entropy  spectrum  analysis  (and  related  procedures)  -  -  the  method  of  choice  for  some  very 
non-Gaussian  time  series  -  -  was  studied,  with  interesting  and  helpful  results,  by  Clifford 
Hurvich  (Ph.D.  Thesis  1985,  Hurvich  1985,  1986). 

5f.  Randomization  in  experimentation 

The  use  of  randomization  in  the  assignment  of  individuals  (plots,  runs,  people)  to 
circumstances  (treatments,  conditions,  etc.),  coupled  with  an  analysis  of  the  results  through 
the  use  of 

i)  simple,  mathematically  easily  manipulable  summaries 

ii)  randomization  moments  for  these  summary  statistics  obtained  by  formula- 

manipulation  mathematics,  and 

iii)  a  hoped-for  (and  often  obtained  approach  to  normality  of  distribution  for  those 

statistics 

is  classical  (much  is  about  a  half  century  old).  This  approach  severely  limited  the  summaries 
that  could  be  used  -  -  ruling  out  much  of  what  robust/resistant  techniques  had  to  offer  -  -  am 
failed  to  take  advantage  of  modern  computing  capabilities. 


The  introduction  of  a  way  of  using,  and  analyzing  the  results  of  randomization  that  was 
compatible  with  modern  computing  -  -  a  way  in  which  a  balanced  subset  of  all  possible 
randomizations  is  chosen,  the  data  is  analyzed  as  if  each  had  been  used,  and  the  position, 
among  all  these  results,  of  the  value  of  the  key  statistic  for  the  actual  randomization  used  is 
employed  -  -  to  give  a  significance  test  available  for  any  choice  of  key  statistics  -  -  seems  to  be 
due  to  D.  R.  Brillinger,  L.  W.  Jones  and  John  Tukey  (*Tukey  (with  DRB,  and  LWJ)  1978g) 
(work  for  the  Weather  Modification  Advisory  Board  of  the  Department  of  Commerce).  Under 
ARO  sponsorship,  John  Tukey  has  carried  this  work  considerably  further  (Tukey  19851, 
Tukey  1986****). 

5g.  Other  procedures 

A  somewhat  diverse  group  of  topics  belong  here,  falling  naturally  into  a  few  categories. 

*  goodness  of  fit  and  modified  chi-square  * 

During  1977-78  Henry  Braun  studied  the  question  of  testing  for  goodness-of-fil  in  the 
presence  of  nuisance  parameters  (Braun  1980a). 

A  modified  chi-square,  stemming  from  earlier  work  by  Freeman  and  Tukey  (*Tukey 
(with  MFF)  1950h)  was  suggested  for  use  in  Mosteller  and  Tukey  (with  FM)  1977b).  Later 
■work  by  John  Tukey  (in  1978-79)  showed  that  this  modification  gave  much  closer 
approximations  to  the  moments  of  tabular  chi-square  then  the  classical  form  when  a  Poisson 
approximation  was  appropriate  (whenever  the  total  sample  size  can  be  thought  of  as  at  least  as 
variable  as  a  poisson  quantity).  Later  work  by  K.  I^arntz  (*Larntz  1978)  showed  that  in  those 
good  ness -of -fit  problems  where  there  are  many  cells  of  equal  expected  size,  and  the  total 
sample  was  fixed,  there  are  discomforts  about  the  tail  probabilities.  An  improved  modification 
has  been  proposed,  and  its  study  begun  (by  Roger  Pinkham  and  John  Tukey). 

*  nominal-by-ordinal  contingency  tables  * 

A  technical  report  on  this  subject  by  Gary  Simon  was  revised  and  issued  in  1974  (Tech. 


Rep.  32,  Simon  1974). 


*  jackknifing  discontinuous  or  nearly  discontinuous  statistics  * 

John  Tukey  has  recently  developed  a  multiple  split-half  jackknife  which  promises  good 
behavior  for  very  uncomfortable  cases  (unpublished  Statistics  411  notes). 

*  consistent  estimation  in  certain  stochastic  processes  * 

P.  Sampson  and  Andrew  Siegel  ((with  PS)  1985b)  have  investigated  consistent  estimation 
in  partially  observed  random  walks. 

*  combination  of  results  * 

Frederick  Mosteller  and  John  Tukey  have  spent  considerable  effort  on  the  bundle  of 
problems  involved  in  the  many  kinds  of  combination  of  results.  Two  papers  (Tukey  (with 
FM)  1982a  and  1982e)  have  been  published,  and  considerable  material  directed  toward  a  book 
has  been  accumulated. 

5h.  Supporting  technology 

*  random  search  in  global  optimization  * 

Work  on  this  topic  by  Peter  Bloomfield  (with  RSA)  was  reported  in  1976  (Bloomfield 
1976). 

*  random  integration  methods  unbiased  for  low-order  polynomials 

Andrew  Siegel  and  Fanny  (Zambuto)  O’Brien  have  developed  random  few-point  designs 
for  integration  over  rectangles  that  exact  for  low-degree  polynomials  and  unbiased  for  any 
integral  function  (Tech.  Rep.  226,  Siegel  (with  FZ)  1983a,  Siegel  (with  FZO)  1984a). 

*  approximations  to  standard  distributions  * 

Starting  from  his  co-editorial  responsibility  for  1985b,  Anita  Parunak  and  John  Tukey 
have  been  working  on  the  development  of  a  satisfactory  continuous  approximation  to  the  tail 


area  of  the  hypergeometric  distribution.  (This  has  a  convenience-only  was  in  avoiding 
calculation  of  factorials  for  possibly  large  numbers,  and  a  non-replaceable  use  as  a 
continuous-parameter  distribution  with  which  to  approximate  various  discrete  distributions.) 

John  Tukey  is  pan  way  through  the  development  of  a  good  continuous  approximation, 
(same  comment)  to  the  F-distribution 

*  antithetic  variates  in  experimental  sampling  * 

Andrew  Siegel  has  found  a  method  of  this  sort  (unpublished). 

*  orthogonal  arrays  in  experimental  sampling  * 

The  use  of  orthogonal  arrays  as  a  source  of  balance  and  thus  of  improved  variance  in 
experimental  sampling  has  been  studied  by  Dhammika  Amaratunga  (Thesis,  1984). 

5i.  Probability  questions 

While  diverse,  many  of  these  problems  were  suggested  by  statistical  or  data-analytic 
questions. 


*  times  to  extinction  for  branching  processes  * 

Method  for  bounding  the  distributions  of  these  times  were  sought  by  Henry  Braun,  and 
reported  (Tech.  Rep.  106,  Braun  1975  and  1978)  as  "Polynomial  bounds  for  probability 
generating  functions". 


*  coverage  problems  on  the  circle  * 

Results  by  Andrew  Siegel  and  Lars  Holst  have  been  reported  (Siegel  and  Holst  1982d). 

*  breakage  problems,  with  application  to  species  abundance  * 

Results  by  Andrew  Siegel  and  G.  Sugihara  have  been  reported  (Tech.  Rep.  215,  Siegel 


and  Sugihara  1983b). 


*  distances  between  all  pairs  of  points  of  a  random  set  * 

Andrew  Siegel  has  found  an  exact  integral  representation  for  the  distribution  of  these 
distances.  Various  asymptotic  results  are  then  easily  obtained  (unpublished). 

5j.  Other  topics 

This  is  the  catchall. 

*  miscellaneous  applications  * 

In  1977,  John  Tukey  talked  on  what  modern  statistical  techniques  might  do  for 
forecasting  (unpublished). 

In  1981,  A.  J.  Arnold  and  Andrew  Siegel  studied  the  regularity  of  triple  points  of 
tectonic  plates  on  the  earth’s  surface.  (Tech.  Rep.  216). 
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BRIEF  OUTLINE  OF  RESEARCH  FINDINGS 


1.  Ongoing  research 

Small  sample  asymptotics  for  regression.  Bioptima]  estimators  for  regression  are  available, 
but  their  computation  pose  serious  problems  in  more  than  two  dimensions.  One  way  to  cope 
with  this  problem  is  to  restrict  the  search  for  the  best  estimator  in  the  class  of  M-estimators. 

By  means  of  small  sample  asymptotics  techniques,  one  can  approximate  the  mean  square 
error  of  an  M-estimator  under  two  (or  more)  sampling  situations.  Then  one  can  write  the 
variational  equation  that  the  bioptimal  or  polyoptimal  (in  the  sense  that  it  cannot  be  improved 
in  all  sampling  situations  simultaneously)  M-estimator  must  satisfy.  This  equation  has  to  be 
solved  numerically. 

Work  has  concentrated  on  the  problem  of  finding  the  optimal  M-estimator  for  location 
models  under  a  single  situation.  The  final  goal  of  this  project  is  to  derive  bioptimal  M- 
estimators  for  regression.  (E  Ronchetti) 

Robustness  in  nonlinear  models.  The  goal  of  this  project  is  to  derive  robust  procedures  for 
some  nonlinear  models.  We  begin  with  a  simple  problem. 

Let  F(x,y)  be  a  bivariate  distribution,  spherically  symmetric,  with  center  of  symmetry  at 
the  origin.  The  data  consists  of  n  iid  observations  (x  j,  y  j)  , . . . ,  (y„  ,  yn  )  with  distribution 
F(x  -  pcos  6  ,  y-p  sin  0  ).  The  parameter  of  interest  is  6 .  The  constant  p  is  assumed  known 
and  determines  the  degree  of  nonlinearity  in  the  problem. 

The  inference  problem  described  is  invariant  under  rotations.  First  we  can  derive  the 
best  equivariant  estimator  under  a  single  situation.  Secondly,  by  applying  the  theory 
developed  for  location  and  linear  regression  models,  it  is  possible  to  compute  bioptimal  and 
polyoptimal  estimators  which  cannot  be  improved  in  all  sampling  situations  simultaneously. 
(E  Ronchetti,  S.  Morgenthaler,  Dept,  of  Statist,  Yale  University) 

Aspects  of  the  analysis  of  data  from  factorial  designs.  One  set  of  work  was  with  John 
Tukey  on  graphical  techniques  for  exploratory  analysis  of  f  actorial  data  sets.  These  tech¬ 
niques  are  extensions  of  the  half-normal  plots  of  Daniel  and  compared  the  ordered  absolute 
values  of  the  normalized  single  degree  of  freedom  contrasts  with  typical  values  of  order  statis¬ 
tics  from  the  half -Gaussian  distribution.  Among  other  things,  these  techniques  allow  the 
selection  of  apparently  important  contrasts,  can  indicate  the  need  to  reformulate  the  data,  and 
can  indicate  the  need  to  select  a  different  set  of  defining  contrasts.  Details  are  given  in  the 
paper  "Graphical  exploratory  analysis  of  variance".  (E.  Johnson,  J.  W.  Tukey) 

Doctoral  research  of  Eugene  Johnson  has  been  concerned  with  the  robust  analysis  of  data 
from  complete,  unreplicated,  factorial  designs.  My  approach  consists  of  a  three-stage  pro¬ 
cedure.  The  first  stage  consists  of  obtaining  a  resistant  fit  to  the  table  where  there  are  exactly 
as  many  nonzero  residuals  as  there  are  residual  (highest  order  interaction)  degrees  of  freedom. 
Such  a  fit  is  called  an  elemental  fit  and  the  set  of  p(=  rank  of  the  design  matrix)  observations 
corresponding  to  the  zero  residuals  is  called  an  elemental  subset. 

An  important  issue  is  the  choice  of  the  particular  fitting  procedure  used  to  obtain  the  ele¬ 
mental  fit.  If  all  goes  well,  the  effects  of  any  exotic  observations  will  be  confined  to  the 
nonzero  residuals.  The  next  phase  consists  of  screening  the  nonzero  residuals  from  the  elemen¬ 
tal  fit  for  potential  outliers  by  comparing  the  magnitudes  of  the  ordered  absolute  residuals 
with  reference  values  from  a  half -Gaussian  distribution.  The  third  phase  consists  of  cleaning 
the  data  by  removing  any  declared  outliers,  replacing  them  with  missing  values  estimates  and 
then  conducting  a  least -squares  analysis  of  the  result  with  appropriate  adjustment  of  the  con¬ 
stituent  degrees  of  freedom.  A  major  determinant  of  the  ultimate  performance  of  the  entire 
three-phase  technique  is  the  breakdown  characteristics  of  the  phase  I  fitting  technique  used. 

A  successful  approach  is  to  use  a  variant  of  median  polish  in  which,  at  each  step,  a 
median  is  swept  from  the  set  of  values  which  are  candidates  for  estimating  a  given  parameter. 
By  proper  choice  of  the  order  of  fitting  and  of  the  type  of  median  to  be  used,  a  high  overall 
breakdown  point  can  be  achieved.  Experimental  sampling  results  indicate  that  a  Gaussian- 
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slash  biefficiency  of  85%  is  achievable  by  applying  the  second  and  third  phases  to  the  results 
of  such  an  elemental  fit.  (E  Johnson) 

Retabing  distributional  systems.  The  4-  and  6-  parameter  g  —  and  h  —  normal-transformation 
families  (Hoaglin,  1985)  offer  a  considerable  range  of  distributional  shapes.  A  particular 
advantage  is  that  a  distribution  can  be  fit  resistantly,  directly  to  the  data  quantiles;  avoiding 
the  use  of  highly  variable  3rd  and  4th  sample  moments  to  fit  a  non-normal  shape.  How  close 
are  the  g  —  and  h  —  distributions  to  members  of  classical  families,  in  particular  the  Pearson 
family?  Moment-matching  (Martinez  &  Iglewicz,  1984)  is  poor  for  percentiles.  This  is  a  seri¬ 
ous  flaw,  in  light  of  the  resistant  quantile-fitting  philosophy  underlying  the  g  —  and  h  —  fam¬ 
ilies.  Calculations  of  Jim  Landwehr  match  percentiles  to  provide  two  g  —  and  two  h  — 
parameters  to  5  S.F.  for  arrays  of  (V  0lt  02)  values  among  the  Pearson  Type  IV,  VI  &  VTI  dis¬ 
tributions. 


The  present  effort  uses  Landwehr’s  data  to  provide  parsimonious  algebraic  expressions 
mapping  (0Jt  02)  to  the  g  —  and  h  —  parameters.  Some  intermediate  results,  e.g.  of  the  form 


go  =  I1  + 


+c*.0i) 


to* 


are  accurate  to  3-4  S.F.  Better  precision  is  expected.  The  goal,  an  invertible  transformation  (or 
pair  of  transformations) 


u,o2,B1, 02 


U, O2,  go,  h0t(g  2,  h2) 


will  be  acceptable  if  percentage  points  and  parameters  (moment-based  and  reshaping  -  g ’s  and 
h ’s)  match  well. 


The  algebraic  forms  of  the  transformations  are  unknown,  involves  generalizations  of 
existing  techniques  for  two-way  tables  (Emerson _&  Wong,  1985):  triangular  non-additive 
tables  of  g  —  and  h  —  parameters  (indexed  by  -J  0!  and  02  )  are  transformed  to  square  addi¬ 
tive  tables.  Iterative  proportional  fitting  of  a  generalized  additive  model,  an  approach  outlined 
by  J.W.  Tukey,  is  being  developed  to  provide  an  invertible  mapping  R2->R2  .  Given  a  func¬ 
tional  form,  the  function  parameters  may  be  optimized  directly.  The  ACE  algorithm  is  a 
related  technique  for  comparison.  (Colin  Goodall) 

Statistics  of  change  in  size  and  shape  for  landmark  data.  Consider  a  sample  of  biological 
organisms,  or  other  geometrical  forms,  each  described  by  the  same  set  of  homologous  point 
landmarks.  The  landmark  co-ordinates  are  measured  at  two  different  times,  in  between 
which  growth  may  have  occurred.  Statistical  analysis  of  change  in  size  of  each  form  is  based 
on  a  univariate  summary,  e.g.  ratio  of  mean-square  dispersion  about  the  centroid  or  ratio  of 
areas.  Statistical  analysis  of  change  in  shape  involves  a  multivariate  statistic,  essentially  the 
residuals  from  the  a  fit  of  translation,  isotropic  scale,  and  rotation  (similarity  transformation) 
to  each  pair  of  forms.  Multivariate  tests,  for  example  the  one-  and  two-sample  location  prob¬ 
lems  for  mean  shape  change,  may  be  based  on  this  statistic.  In  Goodall  (1986)  I  propose,  in 
general  terms,  a  model  comprising  a  continuously-varying  deformation  tensor  field  with  expli¬ 
cit  components  for  measurement  error,  shape  change,  and  inter-individual  variation.  1  also 
elucidate  the  connection  between  the  least-squares  fit  of  an  affine  transformation  to  a  triangle 
of  landmarks  and  Bookstein’s  geometrical  method  ( op  cit).  In  later  work  I  develop  a  classical 
multivariate  approach  to  the  analysis  of  shape  change  in  samples  of  pairs  of  forms  based  on 
Procrustes  methods  for  least-squares  fitting.  Statistical  properties  are  derived  via  perturbation 
analysis.  (Colin  Goodall) 

Engineering  methods  applied  to  models  of  plant  growth.  Ongoing  research  at  Stanford 
University  suggests  that  morphological  development  in  plants  may  be  mediated  in  an  essential 
way  by  the  interaction,  in  the  epidermis,  of  turgor  pressure,  global  geometry,  and  cell-specific 
directional  reinforcement  by  cellulose  microfibrils  (Green  and  Poethig,  1984).  Cell  polarity 


(reinforcement  direction),  cell  division  direction,  and  cell  deformation  appear  closely  interre¬ 
lated  by  genetically-determined  "rules"  of  cell  activity.  A  finite  element  model  of  plant 
growth  has  such  rules  embedded  in  its  constitutive  equations  (Goodall,  1985).  Specific  details 
include  an  abstraction  of  cell  geometry  and  a  scheme  for  simulations  based  on  alternative  rule 
sets.  (Colin  Goodall) 

Smoothing.  A  number  of  draft s  of  a  paper  on  "Thinking  about  smoothing"  have  been  written, 
and  a  Technical  Report  will  soon  be  prepared.  (J.  W.  Tukey) 

Clustering/ dissecting.  A  number  of  generations  of  improvement  have  been  conducted  for 
procedures  for  dissecting  a  set  of  points  (usually  in  the  plane)  consisting  of  a  mixture  of  3 
samples  of  50;  one  from  each  of  three  spherical  Gaussian  distributions  centered  at  the  vertices 
of  an  equilateral  triangle.  Performance,  given  only  the  150  points  (without  distinction 
between  samples)  and  that  3  pieces  are  to  make,  at  mutual  separations  of  3.7cr  is  extremely 
close  to  that  which  knowledge  of  the  population  will  permit;  at  3.2(xit  is  still  close.  The 
problem  of  learning  from  the  data  how  many  pieces  to  seek  has  not  been  addressed. 

(Katherine  Hansen,  J.  W.  Tukey) 

Interactive  analysis.  Motivated  by  the  need  to  display  effectively  what  has  already  been 
tried  on  a  data  set,  John  Tukey  has  prepared  several  drafts  of  an  account  of  what  operations  a 
relatively  complete  computing  system  for  data  analysis  might  provide,  and  how  the  history  of 
the  analysis  might  be  summarized.  (J.  W.  Tukey) 


